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Abstract 

In this paper, we study the problem of estimating a Markov chain X (signal) 
from its noisy partial information Y, when the transition probability kernel 
depends on some unknown parameters. Our goal is to compute the conditional 
distribution process P{X n |Y n , . . . , Y%}, referred to hereafter as the optimal filter. 
Following a standard Bayesian technique, we treat the parameters as a non- 
dynamic component of the Markov chain. As a result, the new Markov chain is 
not going to be mixing, even if the original one is. We show that, under certain 
conditions, the optimal filters are still going to be asymptotically stable with 
respect to the initial conditions. Thus, by computing the optimal filter of the 
new system, we can estimate the signal adaptively. 
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1 Introduction 



Stochastic filtering theory is concerned with the estimation of the distribution of a 
stochastic process at any time instant, given some partial information up to that time 
(optimal filter). The basic model usually consists of a Markov chain X (also called 
the state variable) and possibly nonlinear observations Y with observational noise V 
independent of the signal X. In this case, the optimal filter is completely determined 
by the observations, the transition probability kernel, the distribution of the noise, 
and the initial distribution. One of the problems that often comes up in stochastic 
filtering is when one or more of these elements or, more generally, the model is not 
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exactly known. In this paper, we study the case where the kernel depends on unknown 
parameters. 

The other important problem in stochastic filtering is how to compute the optimal 
filter. With the exception of very few cases (for example, linear Gaussian systems), 
an analytic solution does not exist and we have to resort to numerical methods. One 
of the most efficient schemes for the recursive computation of the optimal filter is 
the Interactive Particle Filter (or Sequential Monte Carlo), first suggested in [THj and 
j!9j . independently. The idea is to approximate the optimal filter by an empirical 
distribution of particles which evolve in a way that imitates the evolution of the 
optimal filter. It has been shown that as the number of particles grows, the empirical 
distribution on these particles converges to the optimal filter, at every time instant 
(for theoretical results regarding the convergence of the Interacting Particle Filter 
see, for example, |14j . [H], or PHI fo r a comprehensive review). 

These two problems are combined in the problem of adaptive estimation, i.e. how 
to estimate the parameters while computing the optimal filter. A natural idea is 
to treat the parameter as part of the state variable and then use some variation of 
the Interactive Particle Filter to compute the optimal filter (see j2S] for a historical 
perspective, as well as [28] and jT] for a more recent discussion). In this case, the 
Bayesian posterior distribution of the parameter is a marginal of the optimal filter. 
Even though there is plenty of numerical evidence showing that the posterior distri- 
bution of the parameter will converge to a delta function on the true value, this has 
not been proved yet, to the best of the writer's knowledge. The existing results on 
the consistency of estimators for the parameters of partially observed Markov chains 
concern other kinds of estimators (see and references within for the case of Hidden 
Markov Models, i.e. partially observed Markov models with finite state space, or ^3] 
for results regarding the consistency of the Maximum Likelihood Estimator for more 
general systems). We will show below that the Bayesian estimator is also consistent. 

When we include the parameter in the state variable, the problem of adaptive 
estimation is clearly connected to the problem of asymptotic stability of the optimal 
filter with respect to its initial conditions. The true initial distribution on the second 
component of our state variable (the parameter) is of course a delta function on the 
true value, while we start with some prior distribution. It is also clear that the system 
is not mixing anymore, even if the signal X is. The study of the asymptotic stability 
of the optimal filter is still an active area of research. In fact, many of the existing 
results for ergodic systems (|2Dl)|2Zj)|211) have to be revised, since the recent discovery 
of a gap in a proof of [20] (see [3| and [I]). The question has been resolved for some 
cases as, for example, mixing systems ([2 ) or particular cases of non-mixing systems 
(see |S],jH] 5 M an d EH)- In this paper, we study the asymptotic stability of the optimal 
filters for systems that are not ergodic and whose ergodic components are actually 
mixing, under certain assumptions regarding the continuity of the kernel. 

Yet, as discussed above, usually we can only hope to compute an approximation to 
the optimal filter. This is always the case in adaptive estimation; even if the system is 
linear and Gaussian, the linearity is lost once we enter the parameters in the system. 
Since the error due to the unknown parameter disappears only asymptotically, we need 
a numerical scheme that converges uniformly with respect to time. The Interactive 
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Particle Filters usually do not satisfy this condition if the system is not ergodic (see 
[13J). A class of Particle Filters (the Monte-Carlo Particle Filter) that converges 
uniformly under relatively weak conditions is described in ^2] • These filters, though, 
are computationally much more expensive than the Interactive Particle Filters, since 
they require good sampling of the path space. In j2S|, we discuss a numerical scheme 
that converges uniformly to the optimal filter in question and is computationally 
relatively efficient. 

The structure of the paper is the following: In section 2, we define the filtering 
problem we will be studying and state our main assumption, concerning the identifi- 
ability of the parameters. In section 3, we prove that the posterior distribution of the 
parameter will, indeed, converge to a delta function on the true value, under certain 
continuity conditions and for prior distributions on the parameter space whose mass 
in the neighborhood of the true value does not disappear "too fast" . 

In order to prove that the optimal filter is asymptotically stable, we first show that 
it is uniformly continuous with respect to the parameters. The uniform continuity of 
the optimal filter has also been studied in [22J, under the assumption that the kernels 
are mixing. In that case, it is shown that if the parameter is fixed to a value different 
than the true one, the total error in the optimal filter will be uniformly bounded 
by the supremum of the step errors - i.e. the errors made when the parameters are 
different just at the last time step - multiplied by some constant that depends on 
the mixing constant. Thus, the uniform continuity of the step errors implies the 
uniform continuity of the optimal filter. In section 4, we review these results and 
give some sufficient conditions for the uniform continuity of the step error to hold, 
that are relatively easy to check. Finally, in section 5, we prove the main result: the 
asymptotic stability of the optimal filter, with respect to the initial conditions, for 
systems that come up when we include the parameters of an otherwise mixing system, 
in the state variable. 

2 Definitions and Assumptions 

Let E be a Polish space, i.e., a complete separable metric space and let us denote 
by 13(E) its Borel a-field. We study the asymptotic behavior of the conditional 
distribution of a Markov chain {X n } taking values in E given some noisy partial 
information, when the kernel depends on an unknown parameter 9. More specifically, 
we study the optimal filter of the following system, which we will refer to as: 

System 1. Let {X n } be a homogeneous Markov chain taking values in (E, 13(E)). 
Let \i be its initial distribution and Kg its transition probability kernel depending on 
a parameter 6 0, Furthermore, we assume that for each 6 0, Kg is Feller 
and mixing, i.e. there exists a constant < eg < 1 and a nonnegative measure 
\g G M. + (E) (M. + (E) being the set of finite nonnegative measures on E), such that 




(1) 



4 



The observation process is defined by 

Y n = h(X n ) + V n , 

where V n are i.i.d. Gaussian(0,a 2 ) MP -valued random variables independent of X 
and h : E — > M p is a bounded continuous function. We denote by g the Gaussian 
probability density function of the observational noise. 

In practice, the parameter space is usually Euclidean. More generally, we assume 
that it is a Polish space, with metric de{-,-)- Most problems are given in the form of 
System 1. Following a standard Bayesian technique, we rewrite the system, so that 
the parameter becomes part of the Markov chain, whose transition probability kernel 
is now completely known: 

System 2. Suppose that {X n = (X n ,8 n )} is an E x Q-valued homogeneous Markov 
chain, with transition probability 

K((x', 9'), dx ® dff) = K e (x', dx) ® 5 e ,(d6). 

and initial distribution fj,®u, where Kg is Feller and mixing in the sense of (QP- The 
observation process is defined by 

Y n = h(X n ) + V n , 

where h(x) = h(x) and x = (x, 9) and the process {Ki}n>o defined as above. 

System 2 can be thought of as a generalization of System 1 and will be the main 
object of study in this paper. Our goal is to show the asymptotic stability of the 
optimal filter of this system with respect to its initial distribution. 

The canonical space of the Markov chain X with kernel Kg and initial distribution 
fi is denoted by (f^ = E N , (jl x) ) n > , P^ e ), where = a(X ,X 1 , . . . ,X n ) is the 
a- algebra constructed by the random variables X , Xi, . . . , X n . Similarly, the obser- 
vation process is defined on the canonical space (O2 = (M P ) N , (Jn )n>o> Q/j,,e), where 
Jn = o~(Yi, . . . , Y n ). The law of the observation process Q^g is given by 

„ n 

QnA d Vki, • • • > dy kn ) = / TT g(y ki - h{x ki ))P^ e {dx kl , . . . , dx kn )dy kl . . . dy kn , 

for any n > and ki, . . . , k n G M + , where E® n = E x ■ • ■ x E is the product space 
of n copies of E. Also, by Q n 6 we denote the restriction of the measure Q^g to the 

sigma algebra Tn ■ 

We can now define the pair process (X,Y) on the space (Q = Q\ x Q 2 , (J~ n = 
)„>o, where the measure P Mi e is such that its marginal distributions 
with respect to X and Y are P^g and Q^e respectively. It is not hard to show that 
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this measure exists (see, for example, [121 )• We will denote the expectation with 
respect to F^ e by E^ e . 

Similarly, we define the triplet (X, Y, 9) on the space (Cl — Q x Q 3 , (T n = T n x 
a (9))n>o, iPV,u)> where 9 is a 0-valued random variable defined on (f2 3 , a (9), u) and 
the marginals of F^ u on (X, Y) and 6* respectively are J e F^gu(d9) and w. We will 
denote the expectation with respect to P M)U by E MiU . 

Furthermore, we denote by ^(/x) and $„(/! ® xx) the optimal filters for Systems 
H] and 121 with initial distributions /x and /x <E> u, respectively, defined as the posterior 
distribution of the state variable given the observations. The name "optimal filters" is 
due to the fact that they are the best estimators adapted to the available information 
(the ex-algebra constructed by the observations), with respect to the L 2 -norm. They 
are random measures on the space E and E x respectively, defined as follows: for 
every / G C b (E x 6), 



$ n (/x<gm)(/) = E fl , u [f(X n ,9)\Y n ,...,Y 1 ] = 

le Je®* f( x m 0) ilLi 9(Yk ~ h(x k ))P^ e (dx u . . . dx n )u(d9) 
Je /z?®« IlLi 9( Y k - Hx^P^eidx!, . . . , dx n )u{d9) 

Similarly, V/' G C b (E), 



•(2) 



<(/*)(/') = E^[f\X n )\Y n ,...,Y 1 } = 

_ Jggn /'(^n) IlLi gOjfc ~ ^(^fc))f M ,e(^l, ■ ■ ■ , rf^n) 

IlLi fl'O'k - h(^k))P f j,,e( dx i^ ■ ■ ■ > 

Clearly, \l/"(/x) is the marginal of $ n (/x <8> # a ) with respect to the state variable 
X, since $ n (/x <X> 5 a )(f) = \I/"(/x)(/ a ), where we used the notation f a (x) = f(x,a). 
Similarly, we define ^(/x) as the marginal of $ n (/i ® ix) with respect to X, i.e., 

\l/£(/x)(cte) = / $ n (n®u)(dx,d9). 
Je 

We want to find sufficient conditions for ^(/x) to be asymptotically stable with 
respect to the initial distribution u. First, we need to check the "identifiability" of 
the parameters. 

The mixing condition (JTJ implies the ergodicity of the signal. We denote by \x$ the 
limiting distribution of the Markov chain whose transition kernel is Kg. The measure 
lie is uniquely defined in this way, as a result of the ergodic property of the signal. 
By vq we denote the limiting distribution of the observation process corresponding to 
parameter value 9. It is easy to check that this is also uniquely defined by: 



Vq = (fi e oh x ) * g. 



(4) 
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We define an equivalence relation on the parameter space as follows: 

a ~ (3 /jL a o h^ 1 = jip o h~ x (5) 

Since g is the Gaussian probability density function, the following definition is equiv- 
alent to ©: 

a ~ (3 -vv- v a = vp. (6) 

We assume that there is no pair of equivalent points in the parameter space. This 
implies that the observation processes corresponding to two different parameter val- 
ues are mutually singular. Otherwise, we might not be able to tell two parameter 
values apart by looking at the observations. A trivial example is when h is constant. 
Problems can also arise when h is symmetric. 

Identifiability condition: By saying that "the identifiability condition holds" or 
that "the parameters are identifiable" , we mean that a ^ j3 implies a ^ (3. 

Remark 2.1. The assumption that the noise {^} ra >o the observation process is 
Gaussian is not really necessary. It just implies the equivalence between definitions 
(GJ) and (G|). For arbitrary observational noise, one just has to define the equivalence 
relation using (GJ) and all the following results will also hold, with the appropriate 
changes in notation. 



3 Consistency of Bayesian estimator 

In this section, we study the behavior of the posterior distribution of the parameter, 
given the observations. We show that under certain conditions and for almost every 
realization y = . . . ,y n , . . .} of the observation process described by System [T] 
corresponding to a fixed value a of the parameter {6 = a), the posterior distribution 
of the parameter P At]U (^|?/), where u is the prior distribution, is a delta function on a. 

If we only assume that the identifiability condition holds, we can show that for 
u-Sk.e. a, 

Qai^Mv) = U = 1, (7) 

meaning that for almost every realization a of the parameter value with respect to 
the probability measure u and almost every realization y of the observation process 
with respect to the measure Q a , the posterior distribution of the parameters given 
the observation is going to be a delta function on a. Equivalently, this can be stated 
as follows: 

u{aee: Q a {y : P M , U (%) = 5 a } = l} = 1. 

This is a consequence of a lemma by Glotzl and Wakolbinger jTTj, where they use 
the notion of ergodic decomposability. It is a promising result, but we actually need 
something stronger. Since we assume that there exists a "true value" for the param- 
eter which is fixed but unknown, we have to be sure that whatever this value is, the 
result will hold. So, we would like prove for every possible value of the parameter. 
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Below, we prove a result for the posterior distribution that is slightly weaker than 
(J7|). for any "reasonable prior" (this will be made more precise below) and under some 
additional assumptions. This result, however, will be sufficient for our purposes. 

Lemma 3.1. Let Y be as described in System^ We further assume that 

• the prior distribution u on the parameter space is such that there exist a sequence 
e n I and a function p : N — > [1, oo) with the following properties: 



n 
and 



■» 0, as n -»• oo (8) 



SU P0eA/- £n (a) d^T.O^n: ■ ■ ■ , Yl) _j_ 



Z E -'< '^M) m<+co ' (9) 

• For each t] > 0, there exists an e > and an I v > such that 

limsup - log f F^ (L n {Y) G B e {v a ))u{d9) < -I n < 0, (10) 

where L n (Y) is the empirical measure of the observation process up to time n, 
i.e. L n (Y) = lTl=i^Y k - 

Then, for every r\ > 0, 

lim E MiQ iV, u (0 G A/» C |F„, • • • , n) = 0. (11) 

n^oo 

Note that we have used the notation Af p (a) to denote the ball of radius p and center 
a with respect to the metric d@, for any p > 0. Similarly, by BSya) we denote the 
ball of radius e and center u a with respect to the Levy-Prohorov metric. Also, by A c 
we denote the complement of any set A. 

The condition on the prior says that the mass around the true value a should not 
go to zero "too quickly" and how quickly is that will depend on how fast the measure 
Qe approaches Q a when 9 goes to a. This condition becomes more clear when applied 
to specific models. 

The proof of lemma 13.11 is given below: 

Proof of lemma f]Ql Let us fix an rj > 0. Then, we can choose e > so that (J 10)) 
holds. We break the expectation in two parts: 

E,,, JV,„(0 G AT v {a) c \Y n , . . . , Yi) = A n + B n , 

where 

A n = E M , Q [l etK) (L n (r))jy iU (# G M v (a) c \Y n , Yx)] 

and 

B n = E^[l (BeK)) c(L n (r))P^, u (0 G A/"» c |y n , . . . , F0]. 
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Clearly 

lim B n < lim Fp a (L n (Y) E (B e (v a )) c ) = 0, Ve > 
by Birkhoff 's ergodic theorem. It remains to show that lim n ^oo A n = 0. We write 

dO n 

A n = E M> [l (BeK))c (L n (r))-E^ = 

dO n 

= E M , ) 4i (BeK)) c(L„(y))a ( ^ (a)) c(0)-^(y n) ...,y 1 )] < 

< %', u [(l( Be K))c(^(y))l(Ar n (a))c(e)) pW+1 ]^E,, J(|£sL)^]i$lr = 

dO n ( ) 

= (E^4l (BeK)) c(L4y))l ( ^ (a))c (0)])^(E M) J(-^)^])^. 

To go from line 1 to line 2 above we used the tower property of the expectations and 
the adaptiveness of the terms outside the conditional expectation to the cr-algebra 
a(Y n , . . . , Yi). To go from line 2 to line 3 we applied the Holder inequality with 
p = p{n) + 1 and q = p)^} ■ The first term can be written as 

n 1 f 

exp( - log / ^,e(Ln{Y) e B e {v a ))u{d6)) 

and it goes to zero as n — > oo, by ((HJ) and (fTUjl. The second term will be uniformly 
bounded, since 

E [( )A, < E [( SUP ^«"^ (y -- ri) ) A 1 

Thus, limn^ooAn = 0, which completes the proof. □ 

From the proof of lemma l3~T| one can easily get an estimate for the rate of conver- 
gence of (fTT|) . If the Markov chain for 9 = a satisfies the LDP, then 

limsup P(n) + 1 logE^P M> (g G A/» c |y n , ...,Y 1 )<-I V < 0. (12) 

n—*oo 

Note that if the prior u puts positive mass on the true value a, i.e. u({a}) > 0, we 
can choose e n = and p(n) = n. Then, the rate of convergence will be exponential 
(in the sense of f)12j) ). 

We would now like to find conditions for fTOj) to hold that only depend on the 
properties of the kernels Kg. If u(N n (o) c ) = it holds trivially, so we assume that 
u{M v {a) c ) > 0. Let us also assume that for each 6 0, the Markov chain satisfies 
the LDP with rate function Ig. This is really just a property of the kernels Kg (see, 
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for example, ^B], for the properties the kernel has to satisfy in order for the Markov 
chain to satisfy the LDP). Then, by the contraction principle, the observation process 
{^n}n>o will a l so satisfy the LDP with a good rate function Jq, given by 

J e {v) := mi{mi {KeT :»K=n} J E / B log ^^(y)K(x, dy)n(dx) + (13) 
+ mi W€Vm} J RP \og^(x)v'(dx); fi G V{E),v G P(W) and v=(jio h' 1 ) * v} 

where T is the class of all transition kernels on E. To show ([10jh we have to show 
that 

limsup sup -logP M / i0 (L n (y) G B e (i/«)) < -J, < 0. 

n->oo 9eJV v (a) c n 

If the parameter space is compact and the distribution P M ' g is continuous with respect 
to 0, we can interchange the limit and the supremum and, consequently, it suffices to 
show that for each 8 G Af v (a) c , 

limsup- log W^ fi {L n {Y) G B e {v a )) < -I v < 0. 

n— >oo Tl 

Since the observation process satisfies the LDP for each 9, we know that 
limsup -logP^, e (L n (y) G B £ (u a )) < -J e (B e {v a )), W G 6, 

where by B e (v a ) we denote the closure of B e (u a ). We need to find an e > 0, such that 

inf J e (BMT)) > 0. (14) 

8eAf v (a) c 

Let us also assume that the rate function Jq is continuous with respect to 9. Then, 
the compactness of the parameter space and the properties of rate functions imply 
that ((Tl|) will be true if for each 9 G A/' r? (a) c , u$ £ B € {ua)- Equivalently, we ask that 

Wr] > 0, 3e > : (L(u e , v a ) < e d e (9, a)<r)), (15) 

where by L(-, •) we denote the Levy-Prohorov metric. This holds if the mapping from 
the parameter space to the space of limiting distributions of the observation process 
is open (i.e. open sets are matched to open sets). 

Note that the continuity of the kernel Kg with respect to the parameter 9 implies 
the continuity of P^g with respect to 9. So, we have found conditions for (fTU|) to hold 
that only involve properties of the kernels. These are summarized in the following 

Lemma 3.2. Suppose that the parameter space is compact and the following hold: 

• for each 9 G 0, the observation process {Y n } n> o satisfies the LDP with rate 
function Jq. 

• The mapping 3 9 i— ► vq G V(W), is an open mapping, in the sense of 

• The kernel K e and the rate function Jq are continuous with respect to 9. 
Then, Mfty will also hold. 
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4 Uniform continuity of the optimal filter 

In this section, we study uniform continuity of the optimal filter, with respect to the 
parameter. This problem was first studied in where it was shown that if the 
kernel is mixing and the step errors converge to zero uniformly, then the total error is 
also going to converge to zero uniformly. We review these results here and give some 
sufficient conditions for the step error to be uniformly continuous. We first review 
the definition of the "Hilbert projective metric" in the space of measures: 

Definition 4.1. Two nonnegative measures fi, fi' G M. + (E) are comparable, if 
afi < /jf < /3fi for suitable positive scalars a and (3. The Hilbert projective metric 
on M + (E) is defined as 

fi(A) 

log^^^f = log(H^llooll^lloo), if V and fjf are comparable 

mt A: M '(A)>0 77(A) P P 

h(p,f/) := { Q if n = fi' = 

otherwise 

+oo, 

„ , , . . (16) 

The kernel norm corresponding to the Hilbert metric is called the Birkhoff con- 
traction coefficient t(K): 

t{K) := sup — — . (17) 

o<h(p,fi')<cx> 'H/ i ? f 1 ) 

Convergence of probability measures in the Hilbert projective metric is stronger 
than convergence in total variation. In fact, the following inequality holds (see [2]): 

2 

log 3 

The following result gives a uniform bound, in terms of the step errors, to the total 
variation distance of optimal filters corresponding to different parameters, one being 
equal to the true parameter value a. The proof of this estimate is given in [22J. We 
have slightly altered the notation there, in order to make it consistent with the rest 
of this paper. 

Lemma 4.2 ([22J, Cor. 4.5). Suppose that the kernel K a is mixing with some e > 0. 
We define the step error with respect to the Hilbert projective metric as follows: 

sZ(o,a) = h(&M,jqi^M)), ( 18 ) 

where K" is the transition kernel at time n of the optimal filter corresponding param- 
eter a, i.e. 

, w * f„ q(Y n — h(x))K a (x'dx)u(dx') , 

K (n ){dx) = r r h I7W / , TT -71 TV 19 
j E J E g{Y n -h{z))K a {z',dz)fi{dz') 
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Then, the total error is uniformly bounded in total variation by 

sup \\ tv < -J—sup6»(9,a) (20) 

n>0 e l-Og6 „>o 

where \\ ■ \\ tv is the total variation norm on the space of measures. 

Note that the quantities appearing in (j2~Uj) are all random. It would have been more 
accurate to write 

supE[|K(/x) - *%(»)\\tv\Y n , ...,Y 1 }< ^— supE[<f*(0, a)\Y n , . . . , Y 1 ], 

n>0 C lOgO n >o 

i.e. the inequality (J2*U|) holds for every fixed realization (?/ n , . . . , of the observation 
process. To simplify the notation, we will avoid writing this conditional expectation 
each time. The comparison between random quantities (depending on the observation 
process) that follow are meant to hold for each fixed realization. 

The proof of lemma 14.21 is quite intuitive: the total error is written as the sum 
of local errors made when the kernels of the optimal filters are different only at one 
time step. Yet, the earlier the error occurs, the smaller the contribution of the local 
error to the total error, since the optimal filter "corrects itself" . This self-correcting 
property is a consequence of the mixing property of the kernel, which guarantees that 
the contribution of the local errors to the total error is going to decay exponentially 
fast as time evolves and thus, the total error which is their sum is going to be bounded. 

More specifically, the proof is based on decomposing the total error to the errors 
made at each time step and then using Birkhoff's contraction coefficient to get a 
bound for the total error with respect to the step errors (|18|) . That is, the error is 
decomposed to: 

n 
k=l 

where K£ +1 n = K£ +1 o • • • o K" is the k-to-n transition kernel of the optimal filter 
corresponding to parameter a G 0. Then, the following inequality that connects the 
total variation norm and the Hilbert projective metric is used: 

\\Kfi - Kfi'Wto < T—T{K)h(fji, //), e K{E) and e V(E), 

log o 

where r is Birkhoff's contraction coefficient and JC(E) is the space of transition ker- 
nels on E. The result follows by the fact that r(K£ +ln ) < (^^) n ~ k (which is a 
consequence of the mixing property of K a with mixing constant e > 0). 

A similar result holds if we assume that the step errors are uniformly bounded 
with respect to the total variation norm instead of the Hilbert projective metric. 
Once again, the following estimate comes from [221 • We rewrite it so that it fits in 
the setting of this paper. 
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Lemma 4.3 ( |22j, Cor. 4. 7). Suppose that the kernel K a is mixing with some e > 0. 
Let 

:= H*iO*) -KS(*J-xO*))ll*», (21) 
where K" is defined as in Mf^l . Then, the total variation norm of the total error is 
uniformly bounded by 

sup - K(»)\\tv < (1 + — ^)sup^(^a). (22) 

n>0 E iOg<J n >0 

The proof of the above lemma is based on the following estimate that provides an 
upper bound for the total variation norm in terms of the Hilbert projective norm: 
if the nonnegative kernel K is mixing, then for any nonzero finite measures /i, y! G 
M + (E): 

h(Kfi,Kfi') < — W-r^w 77m\W 

e 2 n{E) n'(E) 

It is now a straight-forward corollary that the uniform convergence of the step 
errors (fTSj) and (}2T|) implies the uniform continuity of the optimal filters, in the total 
variation norm. 

Corollary 4.4. Suppose that the Hilbert projective metric of the step errors converges 
to zero uniformly with respect to parameter 9: 

hmsup/iK^K^^))) = 0. (23) 



n>0 



Then, the optimal filters are uniformly continuous with respect to 9, in the total 
variation norm: 

limsup|K(//)-^(/i)||,„ = 0. (24) 

e ~* a n>0 

Uniform continuity of the optimal filter with respect to the parameter \24\) also holds 
if ' 

hm sup ||<(/i) - jqx*l-M)\\*> = o. (25) 



n>0 



The problem now becomes how to show the uniform continuity of the step errors. 
It is easy to show the following 

Lemma 4.5. Suppose that for every 6,a 6 and x' G E, the probability measures 
Ke(x', ■) and K a (x', ■) are absolutely continuous with respect to each other. If, in 
addition, there exist functions c(9, a) and d(9, a) such that 

< c(6, a) < d ^''\ (x) < c(6, a) exp(d(9, a)) < +oo (26) 
aK a (x, ■) 

then Hilbert projective distance of the step error is uniformly bounded by 
h(* e M,K a (¥ n _M)) < log(sup^(x',x)sup^(x',x)) =: h(K e , K a ) < d(9,a). 

x',x CtJ^a x',x <lK e 

Consequently, iflimg^ a d(9,a) = 0, then the step error converges uniformly, i.e. J^j) 
holds. 
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The advantage of this lemma is that condition (j2Bj) only involves the kernels Kg 
and K a and can be easily checked, but it is too restrictive. 

The next lemma gives sufficient conditions for ()25|) to hold, that is less restrictive 
than (|26p. by making use of the Mean Value Theorem for real- valued functions defined 
on a Banach space. For the rest of this section, we assume that the parameter space 
is a Banach space with norm || • ||. 

Lemma 4.6. Suppose that there exists a neighborhood J\f(a) of a, such that the 
Frechet derivative of kernel Kg with respect to the weak operator norm exists for 
every 9 G Af(a), i.e. 3 Lg : Cb(E) — > Cb(E) such that 

hml KKe +h f) - KKef) _ KLef)] _ q ? G V{E) and / G C b (E). 



Ho 

Then, for 9 sufficiently close to a (9 G J\f(a) ), 

where by g n we denote the function g n (x) = g(Y n — h(x)). As an immediate conse- 
quence, we have that 

- KS(«J_x(M))IU <2\\e- all S 1 ^]^ - (28) 
Proof. For every n > and 9 G 0, we define a real- valued function F n £ : — ► R by 

) " K « (*n-l(»W) - K _ M{Kgign) » 

for some fixed 9 G A/"(a) and / G Cb(E). Then, 

|*J(A*)(/) - K««_ 1 ( / i))(/)| = |F n , e (0) - F n , e (a)|. 

It is not hard to see that if the Frechet derivative of the kernel exists then the 
Frechet derivative F' n e of F n> g also exists and is equal to 

F , m _ K-M(Le<f9n)K-M(Kg,g n ) - ^_ 1 {y){K e ,fg n )y n _ 1 {v){Lo,g n ) 
which can be easily bounded by: 

I p , ( n,\\ _ IJ B 4(/W-/(^))gn(x)g„( 2 )L 9 K^,^)^/(z , ,^)^^- 1 M(^')^^- 1 (^)(^ / )l ^ 
J E 3 supJ/(x)-/( Z )|.| J g g n (x)L e/ ( 3: ',d a; )|g n ( 2 )E- fl ,(y,^)<„ 1 ( /J )(dx')<_ 1 ( At )(^') 

<2||/||oo^_ i(/t){ ^, gn) . (29) 
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By the Mean Value Theorem (see, for example, ,7J, P- 122), there exists a 9' e 
{a + i(0 - a) : < t < 1}, such that 

F n , e {e) - F ft)9 (a) = ^(00(0 - a), (30) 

Finally, fl2ZJ and © follow from (JU| and jUJ). □ 

Usually, the parameter space is Euclidean and, thus, the Frechet derivatives 
coincide with the "usual derivatives" of real-valued functions defined on the Euclidean 
space. We can also get different bounds for the Frechet derivative. Suppose that the 
function 

Mx)=snp\ dLe f;-\ (x)\ (31) 
is well defined. Then it is easy to see that 

\KA 6 ')\ < 2 ll/lloo%^W^T^ = nf\\oo^ ( n A en(MX n )\Y n , F0, (32) 

where we denote by P^n,^') the distribution of a Markov chain with initial distri- 
bution fi and transition kernel Kq up to time n — 1 and Kgi at times > n and by 
^tt,(n,9,$') the respective expectation. 

So, the uniform boundedness of (J29)) or (J32)) would imply the uniform continuity of 
the optimal filters with respect to the parameter, in the total variation norm. Working 
with ()32|) can be more intuitive, even though it is actually stronger than (J2TJJ) . Note, 
though, that the function A e / is not assumed to be bounded (this would have been too 
restrictive), so getting a uniform bound for ()32|) . for any given observation process, 
is not always possible. However, if we assume that the state space E is compact 
then the continuity of [ dK,^J ) ( x ) 1 with respect to x and x' would imply the uniform 
continuity of the optimal filters, in the sense of (J2U)- 

The above discussion is summarized in the following 

Corollary 4.7. Under the mixing assumption for K a and for 9 sufficiently close to 
a (9 G N(a) ), there exists a 9' e N{ct) so that 

ll*„M - •.(/Oil* < 2(1 + - a|| K M(Ks , gn) ■ 

If the function Aqi is well defined, then 

\\**M - KitiWtv < 2(1 + -J-^)\\9 - a\\K MnA e')(MXn)\Y n , . . • , Y 1 ). 

e 4 log 3 

Consequently, if there exists an M such that 

sup sup r(ii(w), • • • , KM) < M ( 33 ) 

n>0 0,6>'eAf(a>) ^ n-lWK^' 9n) 

for any realization uj G and a sufficiently small neighborhood N{a) of a, then \24\j 
holds. 
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Note, however, that for the asymptotic stability of the optimal filter to hold it is 
sufficient to show the almost sure uniform continuity of the optimal filters, i.e. 

limsupE^ a ||*J(/i) - KitiWtv = 0. (34) 

As before, for this to be true it is sufficient to show that 

9 e n -M(\Le>9n\) „ , 
sup sup - — — — < +oo 

n>0 0,0>eAf{a) ^n-lW\ K 9'9n) 

or, if Agi is well defined, that 

sup sup E /i / iQ ,E Ali(nifl)e /)(A fl /(X n )|y n , . . . , Yi) < oo. (35) 

n>0 8,9'eAf(a) 

Again, this last condition is much easier to work with. We write 

dO n 

E M> E MnAe0 (A,,(X n )|r n , ...,*!) = E MiM , e0 (A e /(X„) ■ (Y n , . . . ,Y1)) < 

i d(D n 
i dO n 

= (^ e K e ,Ag l (X n _ 1 ) n + 1 )^ • (E„/, a ( (y w , . . . , n))w)^T. (36) 

It is easy to see that 

1 ' ' ' ' lJJ E )l , w/) (e4EL 1 <VMX t ,, Vni ... /i) - 
< e ^E*=iE f ,, (n , fllfl / ) ((i r *-h(^*)) a l* r n,-».yi) j 

by Jensen's inequality and the fact that e _ ^ fc = 1 ( yfe_ ' l ( Xfc * ) ) 2 < 1. Consequently, 

But since the observation process is ergodic and function h is bounded, this is also 
going to be bounded, provided that the two first moments of the limiting distribution 
v a exist. This proves the following 

Lemma 4.8. Suppose that the function Ag(x) defined in is well-defined and 

SUp SUp (E^gKg'Ag' (X„_!) n+1 ) ^tt < +00 , (37) 
n>0 6,6'£Af{a) 

for some sufficiently small neighborhood of a, Af(a). Then, if K a is mixing and 
the two first moments of v a exist, the optimal filters will almost surely be uniformly 
continuous with respect to the parameters, i. e. \34\j will hold. 

The importance of this lemma is that its assumptions depend only on properties of 
the kernels. We can now prove the asymptotic stability of the optimal filter of system 
121 with respect to the initial conditions. 
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5 Asymptotic stability of non-mixing systems 

In general, the asymptotic behavior of the optimal filter is not well understood, even 
when the system is ergo die (see jHj and [I], regarding the gap in the proof of the 
ergodicity of the optimal filters, in j20j, and the consequences). If the state space is 
finite ([S]), or the kernel is mixing Q2J, it has been shown that the optimal filters will 
be asymptotically stable. In fact, Chigansky and Lipster, in [H], recently showed that 
the optimal filter will also be stable under conditions that are weaker than mixing 
but stronger than ergodicity of the signal. 

In the following lemma, we prove the asymptotic stability of the optimal filter for 
System which is a non-mixing system. 

Lemma 5.1. Let Y be as in system^ and suppose that the assumptions of lemma 
I,':/. i I are satisfied. We also assume the uniform continuity of the optimal filters, in the 
sense of P^p . Then 

limE^ a ||*^)-*«(/x , )|k = 0, (38) 

n— >oo 

for any initial distributions fi, // G V{E). 

Proof. First, we note that it is sufficient to show ()38|) for the same initial distributions, 
since the rest follows by the asymptotic stability of the optimal filters, for mixing 
kernels. That is, we want to show 

lim E^H^Gu) -*£0) |U = 0, 

n—>oo 

We decompose the optimal filters as follows: 

<$> n {n®u){dx,d6) = F IX:U (X n edx,6 n edO\Y n ,...,Y 1 ) = 

= P M , u (X n G dx\Y m . . . , F l5 9 n = 9)¥^ u (9 n G d6\Y n , . . . , Y x ) = 

= ¥^ e (x n g dx\Y n , . . . , roiv^l^n, ...,y 1 ) = 

= * 9 M(dx)Z^(d6), 
where Z^{dB) = P M , U (0 G d9\Y n , ...,Y 1 ). So, 



< 

Since we have assumed the uniform continuity of the optimal filters (JMJ), Ve > we 
can find a neighborhood J\f v (a) of a for some r\ > 0, such that 

Vn>0, sup E^ ja \\-V M-X(ri\\tv< \- 



tv 



e 



Je 



In 



< 



^,, a I \\* & M-*M\\tvZr(de). 

e 



(39) 
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Also, by lemma ITT] we can find an no such that 

Vn > n , E M ;, Q ^«(jV„(a) c ) < |, 

where J\f v (a) c is the complement of Af v (a). So, putting the last two estimates together, 
we get that Vn > no 

E, l>a \\^M-^)\\ tv < 
sup fleM)(Q) E^W&M - K(»)\\tv + 2E^, a ^«(jV„(a) c ) < e, 
which proves pSjl. □ 

In the following theorem, we prove the asymptotic stability of the optimal filters 
of system |21 under assumption that only involve the kernels Kg] 

Theorem 5.2. Suppose that Y is as in system^ and that the parameter space is a 
compact Banach space. We also assume the following 

1. the prior distribution u on the parameter space is such that there exist a sequence 
e n I and a function p : N — > [1, oo) satisfying JSJ), so that 

limE Mia [( < +oo. 

nt°o ^ U{Ne n {a)) 

2. for each 6 6 6, the observation process {Y n } n>0 satisfies the LDP with rate 
function Jg. 

3. The mapping 3 9 i— > z/g G 7 : '(]R P ) ; zs an open mapping, in the sense of 
^. T/ie kernel Kg and the rate function Jg are continuous with respect to 9. 

5. The function Ag(x) defined in hSl\l is well-defined and 

SUp SUp (E fl) gKg,Ag,(X n _ l ) n+1 )^ < +OO, 

n>0 e,e'^N{a) 

for some sufficiently small neighborhood of a, J\f(a). 

6. The first two moments of v a exist. 

Then, the optimal filter of system^ will eventually correct itself, i.e. it will satisfy 

Proof. Just combine lemmas I3.2[ 14.81 and 15.11 □ 
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